A Query Engine for Retrieving Information from Chinese HTML Documents

نویسندگان

  • Lihua Zhang
  • Yiu-Kai Ng
چکیده

The amount of online information in Chinese and the number of Chinese Internet users have been increasing tremendously during the past decade. Since Chinese language is significantly different from English, techniques that have been developed for retrieving information from English Web documents cannot be directly applied to retrieve information from Chinese Web documents. In order to provide high-performance access of Chinese information on the Web, we have developed a Chinese Web query engine that (i) extracts (hierarchical) data of interest from Chinese HTML tables using an information extraction tool called semantic hierarchy, (ii) allows the user to submit queries in Chinese using a menu-driven user interface, and (iii) processes the user’s queries (as Boolean expressions) to generate the correct results. Our query engine supports various groups of information that are categorized into various subject areas, such as car ads, house rentals, job ads, stocks, university catalogs, etc. We have tested our information extraction tool on two application domains, car-ads and house-rental. The average F-measure on extracting Chinese data from these two application domains is above 90%. More importantly, our query engine can easily be configured and internationalized to become a worldwide, multilingual query engine with minor changes in system settings on PCs running Windows operating systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Library of Document Analysis Components

Extracting information from high quality documents is one of the challenges for intelligent agents. A typical user query that such an agent might have to answer is: “Find a hotel in Mallorca that is close to the see, I also want a photo of the hotel”. The answer to the query might be found in a brochure from a travel agency and answering it involves many steps one of which is to analyse the con...

متن کامل

Searching and Generating Authoring Information: A Hybrid Approach

In this paper, the authors propose a novel approach to search and retrieve authoring information from online authoring databases. The proposed approach combines keywords and semantic-based methods. In this approach, the user can retrieve such information considering some specified keywords and ignore how the internal semantic search is being processed. The keywords entered by the user are inter...

متن کامل

A Prototype System for Retrieving Dynamic Content

With the advances in web technologies, web pages are no longer confined to static HTML files that provide direct content. This leads to more interactivity of web pages and at the same time to ignoring a significant part of the Web by search engines (or web crawlers) due to their inability to analyze and index most dynamic web pages. In this paper, we present a prototype system for retrieving dy...

متن کامل

A Unified Approach to Retrieving Web Documents and Semantic Web Data

The Semantic Web seems to be evolving into a property-linked web of RDF data, conceptually divorced from (but physically housed in) the hyperlinked web of HTML documents. We discuss the Unified Web model that integrates the two webs and formalizes the structure and the semantics of interconnections between them. We also discuss the Hybrid Query Language which combines the Data and Information R...

متن کامل

Categorisation by Context

Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material on the Web, and therefore it will be soon necessary to resort to techniques for automatic c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Int. J. Comput. Proc. Oriental Lang.

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2004